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Abstract 

Background: Chromatin immunoprecipitation coupled with hybridization to a tiling array (ChlP-chip) is a 
cost-effective and routinely used method to identify protein-DNA interactions or chromatin/histone modifications. 
The robust identification of ChlP-enriched regions is frequently complicated by noisy measurements. This 
identification can be improved by accounting for dependencies between adjacent probes on chromosomes and by 
modeling of biological replicates. 

Results: MultiChlPmixHMM is a user-friendly R package to analyse ChlP-chip data modeling spatial dependencies 
between directly adjacent probes on a chromosome and enabling a simultaneous analysis of replicates. It is based on 
a linear regression mixture model, designed to perform a joint modeling of immunoprecipitated and input 
measurements. 

Conclusion: We show the utility of MultiChlPmixHMM by analyzing histone modifications of Arabidopsis thaliana. 
MultiChlPmixHMM is implemented in R and including functions in C, freely available from the CRAN web site: http:// 
cran.r-project.org. 



Background 

Chromatin immunoprecipitation coupled with hybridiza- 
tion to a tiling array (ChlP-chip) is a cost-effective and 
routinely used method for identifying target genes of tran- 
scription factors, for analyzing histone modifications or 
for studying the methylome on a genome-wide scale [1]. In 
a ChlP-chip experiment, a chromatin immunoprecipita- 
tion sample (IP) is compared against a reference sample of 
genomic DNA (Input). In recent years, different methods 
for the identification of ChlP-enriched regions have been 
developed. Among them, [2] proposed a linear regres- 
sion mixture model named ChlPmix, designed to perform 
a joint modeling of IP and Input measurements. This 
two-component mixture model discriminates the popu- 
lation of enriched probes from non-enriched ones. Over 
the last years, ChlPmix has successfully been applied to 
the identification of methylated gene promoters, histone 
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modifications or transcription factor target genes (e.g. 
[3-7]). However, ChlPmix has basically two impor- 
tant limitations: it does not model spatial dependencies 
between adjacent probes on chromosomes and it also 
does not handle the joint analysis of multiple biological 
replicates. 

Here, we present MultiChlPmixHMM for ChlP-chip 
analyses enabling modeling of spatial dependencies and 
a simultaneous analysis of replicates to further improve 
the identification of enriched probes. We demonstrate 
improved performance of MultiChlPmixHMM compared 
to ChlPmix for the target identification of the chromatin 
mark H3K27me3 of the model plaint Arabidopsis thaliana. 

Implementation 

MultiChlPmixHMM is based on a two-state first-order 
Hidden Markov Model (HMM) with state-specific Gaus- 
sian emission distributions modeling immunoprecipated 
signals as a linear regression of reference input signals. Let 
(xtn Jtr) be the pair of log-Input and log-IP intensities of 
probe t measured in replicate r of a ChlP-chip experiment. 
The hidden state of probe t is modeled by Zt ^ {0, 1} to 
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distinguish enriched {zt = 1) from non-enriched probes 
[zt = 0). The Gaussian emission density of state Zt mod- 
eling R replicates is given by a product of independent 
Gaussian distributions 

R 

f(ytl> . . . >ytR\Zt) = Yi-^^^ztr + bztrXtr> Cf^) 
r=l 

with specific mean Uz^r + bz^r^tr and variance for each 
replicate r g {1, . . . Dependencies between adjacent 
genomic probes t and ^ + 1 are modeled by a first-order 
Markov chain defining that the next state z^+i is depend- 
ing on the predecessor state Zt. All parameters of the 
HMM are estimated using the Baum-Welch algorithm 
[8] representing a special case of the EM algorithm [9]. 
To obtain relevant initial values of the emission distribu- 
tion parameters (slopes and intercepts of the regressions), 
we applied a Principal Component Analysis to each bio- 
logical replicate and used the first axis to derive the 
intercept and slope of the regression. All initial transition 
parameters are set to 0.5. This reflects the typical case 
where no biological information is available. We observed 
on simulations that alternative choices for the transition 
matrix initialization lead to similar results (not shown). 
Identification of enriched probes is based on conditional 
probabilities. A probe is declared enriched if its enriched 
conditional probability (state-posterior probability of the 
enriched state) is higher than 1 — a, where a is chosen 
by the user. This strategy has been proved to yield in 
controlling the proportion of misclassification in mixture 
models [10]. 



Results and discussion 

Simulations 

In this section, we first compare ChlPmix, MultiChlP- 
mixHMM and TileHMM [11], which is a method based 
on an HMM model to analyze the logratios (IP over 
Input). Moreover TileHMM can handle multiple repli- 
cates. We simulated data according to a two-state HMM 
with state-specific Gaussian emission distributions mod- 
eling immunoprecipated signals as a linear regression of 
reference input signals. We considered two test scenar- 
ios: (i) well-separated non-enriched and enriched probes 
(slope parameters 0.6 and 0.99) and (ii) overlapping pop- 
ulations of non-enriched and enriched probes (slope 
parameters 0.5 and 0.65). Two biological replicates are 
simulated for each scenario. The transition matrix is set 

/ 0.97 0.03 \ J , . ^ r , 

I A 1 A o I variances are set to 0.7 tor the 

y 0.1 0.9 / 

first replicate and 0.75 for the second. We used the cor- 
responding method-specific conditional probabilities for 
probes to be enriched to display ROC curves. For ChlP- 
mix, that returns a set of probe conditional probabilities 
per replicate, we summarized the results by taking either 
the minimal (resp. maximal) conditional probabilities over 
the two replicates. 

On the ROC curves, we can observe that MultiChlP- 
mixHMM outperforms the other methods whatever the 
scenario (cf. Figure 1). We further analyse the results after 
classification by choosing a level a = 0.01. 

The comparison is performed in Table 1. While 
conservative, ChlPmix and MultiChlPmixHMM correctly 
control the proportion of FP at the required 0.01 level. On 
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Figure 1 ROC curves. ROC curves for ChlPmix, MultiChlPmixHMM and TileHMM, for the two simulated scenarios. In ChlPmix_Union, the minimal 


value of the conditional probability over the replicates is considered. In ChlPmixJnter, this is the maximal value. 
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Table 1 Comparison of ChlPmix, MultiChlPmixHMM and 
TileHMM after classification 



Scenario 1, classification with a 


= 0.01 




false positive rate 


true positive rate 


ChlPmix Union 


3.79e-04 


0.32 


ChlPmix Intersection 


0 


0.1 


MultiChlPmixHMM 


1.12e-04 


0.42 


TileHMM 


0.13 


0.83 



the contrary, TileHMM results in a higher TP rate, but to 
the price of a FP rate ten time higher than the required 
level. 



Arabidopsis dataset analysis 

To illustrate the benefit of using MultiChlPmixHMM 
compared to standard ChlPmix, we use a normal- 
ized ChlP-chip data set of the model plant Arabidopsis 
thaliana by [6] to compare the identification of genomic 
regions marked by histone H3 tri-methylated at lysine 27 
(H3K27me3). We applied both methods to analyze the 
two biological replicates and identified probes enriched in 
H3K27me3 using a stringent cutoff of 1 — a = 0.99. Since 
ChlPmix does not handle multiple replicates, both repli- 
cates were analyzed separately and only probes declared 
as enriched in both replicates were finally considered 
as enriched (considering probes declared enriched for at 
least one of the replicates leads to similar results). 



Considering the decodings of individual probes, ChlP- 
mix and MultiChlPmixHMM provide the same sta- 
tus prediction (non-enriched or enriched) for more 
than 90% of the probes. Focusing on enriched probes, 
all the 8, 100 probes identified by ChlPmix are also 
included in the set of enriched probes identified by 
MultiChlPmixHMM. MultiChlPmixHMM also identi- 
fied 7, 940 additional probes enriched in H3K27me3. In 
good agreement with previous findings [6], we find that 
probes marked by H3K27me3 are preferentially associ- 
ated with genes. ChlPmix found about 3000 enriched 
probes associated with genes while there are approxi- 
mately 2000 more for MultiChlPmixHMM. Among these 
2000 additional probes, about 1500 complete regions 
already found by ChlPmix, while 536 probes concern 
254 new genes. We further analyzed the identified 379 
genes targeted by H3K27me3 that have been identified by 
both methods. Considering MultiChlPmixHMM, these 
genes are covered by 1616 enriched probes compared to 
only 939 enriched probes identified by ChlPmix. Thus, 
the modeling of spatial dependencies between probes 
by MultiChlPmixHMM leads to a better modeling of 
enriched probes along genes. Furthermore, MultiChlP- 
mixHMM identified 254 new target genes. This is exem- 
plarily illustrated in Figures 2 and 3, where additional 
probes identified as enriched by MultiChlPmixHMM 
extend or complete enriched regions identified by 
ChlPmix. 
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Figure 2 Comparison of ChlPmix and MultiChlPmixHIVllVI. Comparison of ClilPnnix and MultiClilPnnixHMM illustrated for a selected region on 
chromosome 4. Probes identified as enriched are shown in red. Non-enriched probes are displayed in black. Blue bars correspond to the location of 
genes. 
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Figure 3 Comparison of ChlPmix and MultiChlPmixHMIVI. Example of one known H3K27me3 target gene identified only with 
MultiChlPmixHMM. The second line corresponds to the log-ratio signal, and for the two last lines, the scale corresponds to the conditional 
probabilities. We note that the conditional probabilities are clearly higher with MultiChlPmixHMM. 



To further validate these findings, we use known 
H3K27me3 target genes based on independent prior stud- 
ies by [12] and by [13]. Among the 311 genes found by 
both studies, 298 were commonly identified by ChlP- 
mix and MultiChlPmixHMM. Additionally, MultiChlP- 
mixHMM identifies 11 genes exclusively, which have 
already been identified as target genes in at least one of 
the two studies. Importantly, this increase of detection 
power comes without an additional computational time, 
because the main algorithm of MultiChlPmixHMM is 
implemented in C. 

Conclusions 

The R package MultiChlPmixHMM implements a lin- 
ear regression mixture model to analyse ChlP-chip data. 
In order to provide a more accurate identification of 



enriched probes, it enables to take into account spa- 
tial dependencies between directly adjacent probes and a 
simultaneous analysis of replicates. The benefits of Multi- 
ChlPmixHMM have been shown by analyzing both sim- 
ulated and real datasets, and by comparing competing 
softwares. 

Availability and requirements 

MultiChlPmixHMM is publicly available as an R pack- 
age from CRAN [14]. Two functions are implemented 
and refer to the models describe before. To distin- 
guish between the model and the function, the first 
letter of the name of the function is a lower case: (i) 
multiChlPmixHMM for modeling spatial dependencies 
and multiple replicates and (ii) multi ChlPmix to model 
multiple replicates ignoring spatial dependencies between 
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probes. Both functions take as input a vector of filenames 
(one biological replicate per file), and display as output 
a file containing the enriched conditional probability and 
status of each probe. 

• Project name: MultiChlPmixHMM 

• Project home page: http://cran.r-project.org/web/ 
packages/MultiChlPmixHMM/index.html 

• Operating system(s): platform independent 

• Programming language: R and C 

• Other requirements: No 

• License: GNU GENERAL PUBLIC LICENSE 

• Any restrictions to use by non-academics: it is 
available for free download. 
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